Exploratory Data Analysis Basics#

EDA Definition#

drawing

Razones para hacer un EDA#

  • Organizar y entender las variables

  • Establecer relaciones entre las variables

  • Encontrar patrones ocultos en los datos

  • Ayuda a escoger el modelo correcto para la necesidad correcta

  • Tomar decisiones informadas

Pasos de EDA#

drawing

Tipos de Analitica de datos#

drawing

Project#

Environment Setup#

Library installation#

!pip install --upgrade pip

!pip install palmerpenguins==0.1.4 numpy==1.23.4 pandas==1.5.1 seaborn==0.12.1 matplotlib==3.6.0 empiricaldist==0.6.7 statsmodels==0.13.5 scikit-learn==1.1.2 pyjanitor==0.23.1 session-info
Requirement already satisfied: pip in c:\users\admin\miniconda3\envs\jupyter_notebooks_from_deepnote\lib\site-packages (23.2.1)
^C
Collecting palmerpenguins==0.1.4
  Using cached palmerpenguins-0.1.4-py3-none-any.whl (17 kB)
Collecting numpy==1.23.4
  Using cached numpy-1.23.4-cp311-cp311-win_amd64.whl (14.6 MB)
Collecting pandas==1.5.1
  Using cached pandas-1.5.1-cp311-cp311-win_amd64.whl (10.3 MB)
Collecting seaborn==0.12.1
  Using cached seaborn-0.12.1-py3-none-any.whl (288 kB)
Collecting matplotlib==3.6.0
  Using cached matplotlib-3.6.0-cp311-cp311-win_amd64.whl (7.2 MB)
Collecting empiricaldist==0.6.7
  Using cached empiricaldist-0.6.7.tar.gz (11 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting statsmodels==0.13.5
  Using cached statsmodels-0.13.5-cp311-cp311-win_amd64.whl (9.0 MB)
Collecting scikit-learn==1.1.2
  Using cached scikit-learn-1.1.2.tar.gz (7.0 MB)
  Installing build dependencies: started

Library import#

import empiricaldist
import janitor
import matplotlib.pyplot as plt
import numpy as np
import palmerpenguins
import pandas as pd
import scipy.stats
import seaborn as sns
import sklearn.metrics
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats as ss
import session_info

Stablish Chart Settings#

%matplotlib inline
sns.set_style(style='whitegrid')
sns.set_context(context='notebook')
plt.rcParams['figure.figsize'] = (11, 9.4)

penguin_color = {
    'Adelie': '#ff6602ff',
    'Gentoo': '#0f7175ff',
    'Chinstrap': '#c65dc9ff'
}

Importing Dataset#

The dataset chosen for this project is the palmerpenguins dataset

The goal of the Palmer Penguins dataset is to replace the highly overused Iris dataset for data exploration & visualization.

The information contains size measurements, clutch observations, and blood isotope ratios for 344 adult foraging Adélie, Chinstrap, and Gentoo penguins observed on islands in the Palmer Archipelago near Palmer Station, Antarctica.

Data were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica Long Term Ecological Research (LTER) Program.

Raw Data#

This package contains a raw dataset (literally contains the data as it was collected)

raw_df = palmerpenguins.load_penguins_raw()

raw_df
studyName Sample Number Species Region Island Stage Individual ID Clutch Completion Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm) Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo) Comments
0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 Yes 2007-11-11 39.1 18.7 181.0 3750.0 MALE NaN NaN Not enough blood for isotopes.
1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 Yes 2007-11-11 39.5 17.4 186.0 3800.0 FEMALE 8.94956 -24.69454 NaN
2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1 Yes 2007-11-16 40.3 18.0 195.0 3250.0 FEMALE 8.36821 -25.33302 NaN
3 PAL0708 4 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A2 Yes 2007-11-16 NaN NaN NaN NaN NaN NaN NaN Adult not sampled.
4 PAL0708 5 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N3A1 Yes 2007-11-16 36.7 19.3 193.0 3450.0 FEMALE 8.76651 -25.32426 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
339 PAL0910 64 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N98A2 Yes 2009-11-19 55.8 19.8 207.0 4000.0 MALE 9.70465 -24.53494 NaN
340 PAL0910 65 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A1 No 2009-11-21 43.5 18.1 202.0 3400.0 FEMALE 9.37608 -24.40753 Nest never observed with full clutch.
341 PAL0910 66 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N99A2 No 2009-11-21 49.6 18.2 193.0 3775.0 MALE 9.46180 -24.70615 Nest never observed with full clutch.
342 PAL0910 67 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A1 Yes 2009-11-21 50.8 19.0 210.0 4100.0 MALE 9.98044 -24.68741 NaN
343 PAL0910 68 Chinstrap penguin (Pygoscelis antarctica) Anvers Dream Adult, 1 Egg Stage N100A2 Yes 2009-11-21 50.2 18.7 198.0 3775.0 FEMALE 9.39305 -24.25255 NaN

344 rows × 17 columns

processed data#

Dataset containing already processed data

df = palmerpenguins.load_penguins()

df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

Using the seaborn dataset#

Seaborn also contains this dataset, let’s see it:

sns.load_dataset("penguins")
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

Performing EDA#

Now let’s answer some questions, remember that the name of our dataframe is stored in df

What are the variables types for this dataset?#


You can use pandas attribute dtype to do this:

df.dtypes
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
year                   int64
dtype: object
  • “object” is usually synonymous of categorical variables

  • “float64” is usually synonymous of Numerical continuous variables

  • “int64” is usually synonymous of Numerical discrete variables

How many variables of each type is present in the dataset?#


The value_counts() attribute can help

(
    df
    .dtypes
    .value_counts()
)
float64    4
object     3
int64      1
dtype: int64

How many variables and records are in the dataset?#

The .shape() function can help with this

df.shape
(344, 8)
  • 344 rows

  • 8 columns

That answers the question

Are there null values in the dataset?#

  • .isnull() returns boolean values

  • any() tells you if null values are present in a column

See the result below:

df.isnull().any()
species              False
island               False
bill_length_mm        True
bill_depth_mm         True
flipper_length_mm     True
body_mass_g           True
sex                   True
year                 False
dtype: bool

How many null values are there in each column?#

use .sum() to obtain the results

df.isnull().sum()
species               0
island                0
bill_length_mm        2
bill_depth_mm         2
flipper_length_mm     2
body_mass_g           2
sex                  11
year                  0
dtype: int64

What is the total amount of null values in the dataset?#

call .sum() again in the previous result

df.isnull().sum().sum()
19

What is the null values proportion for each variable?#

let’s show it in percentage

( df.isnull().sum()/(df.shape[0]) )*100
species              0.000000
island               0.000000
bill_length_mm       0.581395
bill_depth_mm        0.581395
flipper_length_mm    0.581395
body_mass_g          0.581395
sex                  3.197674
year                 0.000000
dtype: float64

You can actually make a chart to represent this:

melted = df.isnull().melt()

melted
variable value
0 species False
1 species False
2 species False
3 species False
4 species False
... ... ...
2747 year False
2748 year False
2749 year False
2750 year False
2751 year False

2752 rows × 2 columns

melted.pipe(
    lambda x : sns.displot(data=x, y="variable", hue="value", multiple="fill", aspect=2)
)
<seaborn.axisgrid.FacetGrid at 0x7fa235ed23a0>
../_images/b8bbef47af0f20141c8afaecb45367bc3331c5c9e56918ecd55f354bddc9a9aa.png

Los valores nulos no llegan ni al 10%

What are the records that contain the null values in the dataset? (Visualize)#

First create the dataframe

nulls = df.isnull().transpose()
nulls
0 1 2 3 4 5 6 7 8 9 ... 334 335 336 337 338 339 340 341 342 343
species False False False False False False False False False False ... False False False False False False False False False False
island False False False False False False False False False False ... False False False False False False False False False False
bill_length_mm False False False True False False False False False False ... False False False False False False False False False False
bill_depth_mm False False False True False False False False False False ... False False False False False False False False False False
flipper_length_mm False False False True False False False False False False ... False False False False False False False False False False
body_mass_g False False False True False False False False False False ... False False False False False False False False False False
sex False False False True False False False False True True ... False False False False False False False False False False
year False False False False False False False False False False ... False False False False False False False False False False

8 rows × 344 columns

Now the chart

plt.figure(figsize=(7, 7))
nulls.pipe(
    lambda x : sns.heatmap(data=x)
)

# Alternatively, use:
#df.isnull().transpose().pipe(lambda x : sns.heatmap(data=x))
<AxesSubplot: >
../_images/7cc0c5f8be1cd12ff2a5dc36cdaf4d6fdac3a7fba992c6740fb70d7df575a254.png

Here you can evidence:

  • The missing values (Not corresponding to sex) are associated with only two records

  • The sex missing values are consistant, we should not worry about that

  • You might want to delete those two records that contain the vast majority of missing values (Not corresponding to sex)

How many records will be lost if we delete all the missing values?#

df.dropna()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

333 rows × 8 columns

Note that now you have 333 rows, instead of the original 344

THIS COMMAND DOES NOT MODIFY YOUR ORIGINAL DATABASE!!!!

df.shape
(344, 8)

DO YOU SEE?

SO, LET’S SAVE IT into another variable ;)

df_clean = df.dropna()

df_clean
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 male 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

333 rows × 8 columns

Counts and Proportions#

What measures describe our dataset?#

The function .describe() can do this for us, but it will just do it for numerical variables.

To include strings use .describe(include=”all”)

df.describe(include="all")
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
count 344 344 342.000000 342.000000 342.000000 342.000000 333 344.000000
unique 3 3 NaN NaN NaN NaN 2 NaN
top Adelie Biscoe NaN NaN NaN NaN male NaN
freq 152 168 NaN NaN NaN NaN 168 NaN
mean NaN NaN 43.921930 17.151170 200.915205 4201.754386 NaN 2008.029070
std NaN NaN 5.459584 1.974793 14.061714 801.954536 NaN 0.818356
min NaN NaN 32.100000 13.100000 172.000000 2700.000000 NaN 2007.000000
25% NaN NaN 39.225000 15.600000 190.000000 3550.000000 NaN 2007.000000
50% NaN NaN 44.450000 17.300000 197.000000 4050.000000 NaN 2008.000000
75% NaN NaN 48.500000 18.700000 213.000000 4750.000000 NaN 2009.000000
max NaN NaN 59.600000 21.500000 231.000000 6300.000000 NaN 2009.000000

What if i want only numerical variables?

df.describe(include=[np.number])
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g year
count 342.000000 342.000000 342.000000 342.000000 344.000000
mean 43.921930 17.151170 200.915205 4201.754386 2008.029070
std 5.459584 1.974793 14.061714 801.954536 0.818356
min 32.100000 13.100000 172.000000 2700.000000 2007.000000
25% 39.225000 15.600000 190.000000 3550.000000 2007.000000
50% 44.450000 17.300000 197.000000 4050.000000 2008.000000
75% 48.500000 18.700000 213.000000 4750.000000 2009.000000
max 59.600000 21.500000 231.000000 6300.000000 2009.000000

What if i want only categorical variables?

df.describe(include=object)
species island sex
count 344 344 333
unique 3 3 2
top Adelie Biscoe male
freq 152 168 168

BONUS: Transform an object into a category#

See how to transform a column with the data type “object into a category”

This allows cool things such as see the proportions

df.astype({
        "species":"category",       #just insert a dict where key is column name and value is the type
        "island":"category",
        "sex":"category"
    })
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007
... ... ... ... ... ... ... ... ...
339 Chinstrap Dream 55.8 19.8 207.0 4000.0 male 2009
340 Chinstrap Dream 43.5 18.1 202.0 3400.0 female 2009
341 Chinstrap Dream 49.6 18.2 193.0 3775.0 male 2009
342 Chinstrap Dream 50.8 19.0 210.0 4100.0 male 2009
343 Chinstrap Dream 50.2 18.7 198.0 3775.0 female 2009

344 rows × 8 columns

How to visualize counts?#

Pandas

plt.figure(figsize=(6, 6))
df.species.value_counts().plot(kind="bar");
../_images/95f13fb66821ed36d730a11c9a3a454d127aff0abfb4577bdafdc550e944b71e.png

Seaborn

sns.catplot(data=df, x="species", kind="count", palette=penguin_color);
../_images/ab886059541f40cca83212b1edebb49845ef0080f314ff9f4a549fbe798959af.png

How to visualize proportions#

Seaborn

df.add_column("x", "").pipe(
    lambda df: (
        sns.displot(data=df, x="x", hue="species", multiple="fill", palette=penguin_color)
    )
);
../_images/d59cba64c1475a939fc47c037d45b8245f04b4c42b540317a4ad7288c757eadd.png

Measures of central tendency#

Mean#

for all the variables

df.mean()
/tmp/ipykernel_1302/3698961737.py:1: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.mean()
bill_length_mm         43.921930
bill_depth_mm          17.151170
flipper_length_mm     200.915205
body_mass_g          4201.754386
year                 2008.029070
dtype: float64

A penguin is around 4 kg LOLLLLLL

Only one variable

df.bill_depth_mm.mean()
17.151169590643274

Median#

df.median()
/tmp/ipykernel_1302/530051474.py:1: FutureWarning: The default value of numeric_only in DataFrame.median is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.median()
bill_length_mm         44.45
bill_depth_mm          17.30
flipper_length_mm     197.00
body_mass_g          4050.00
year                 2008.00
dtype: float64

Mode#

df.mode()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Biscoe 41.1 17.0 190.0 3800.0 male 2009

Works for categorical too

Measures of dispersion#

Maximum value for each variable?#

df.max(numeric_only=True)           #numeric_only=True excludes the categorical variables(makes sense)
bill_length_mm         59.6
bill_depth_mm          21.5
flipper_length_mm     231.0
body_mass_g          6300.0
year                 2009.0
dtype: float64

Minimum value for each variable?#

df.min(numeric_only=True)
bill_length_mm         32.1
bill_depth_mm          13.1
flipper_length_mm     172.0
body_mass_g          2700.0
year                 2007.0
dtype: float64

Range of each variable#

df.max(numeric_only=True) - df.min(numeric_only=True)
bill_length_mm         27.5
bill_depth_mm           8.4
flipper_length_mm      59.0
body_mass_g          3600.0
year                    2.0
dtype: float64

Standart deviation#

df.std()
/tmp/ipykernel_1302/3390915376.py:1: FutureWarning: The default value of numeric_only in DataFrame.std is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.std()
bill_length_mm         5.459584
bill_depth_mm          1.974793
flipper_length_mm     14.061714
body_mass_g          801.954536
year                   0.818356
dtype: float64

Interquartile range?#

The IQR equals Q3-Q1, and it holds the 50% of the data around the mean

Q1

df.quantile(0.25)
/tmp/ipykernel_1302/3656653379.py:1: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.quantile(0.25)
bill_length_mm         39.225
bill_depth_mm          15.600
flipper_length_mm     190.000
body_mass_g          3550.000
year                 2007.000
Name: 0.25, dtype: float64

Q3

df.quantile(0.75)
/tmp/ipykernel_1302/3799946287.py:1: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.quantile(0.75)
bill_length_mm         48.5
bill_depth_mm          18.7
flipper_length_mm     213.0
body_mass_g          4750.0
year                 2009.0
Name: 0.75, dtype: float64

IQR

(
    df
    .quantile(q=[0.75, 0.50, 0.25])
    .transpose()
    .rename_axis("variable")
    .reset_index()
    .assign(
        iqr=lambda df: df[0.75] - df[0.25]
    )
)
/tmp/ipykernel_1302/4097684495.py:2: FutureWarning: The default value of numeric_only in DataFrame.quantile is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df
variable 0.75 0.5 0.25 iqr
0 bill_length_mm 48.5 44.45 39.225 9.275
1 bill_depth_mm 18.7 17.30 15.600 3.100
2 flipper_length_mm 213.0 197.00 190.000 23.000
3 body_mass_g 4750.0 4050.00 3550.000 1200.000
4 year 2009.0 2008.00 2007.000 2.000

Probabilities#

Probability Mass function#

Probability for each value

empiricaldist.Pmf.from_seq(df.flipper_length_mm)
probs
172.0 0.002924
174.0 0.002924
176.0 0.002924
178.0 0.011696
179.0 0.002924
180.0 0.014620
181.0 0.020468
182.0 0.008772
183.0 0.005848
184.0 0.020468
185.0 0.026316
186.0 0.020468
187.0 0.046784
188.0 0.017544
189.0 0.020468
190.0 0.064327
191.0 0.038012
192.0 0.020468
193.0 0.043860
194.0 0.014620
195.0 0.049708
196.0 0.029240
197.0 0.029240
198.0 0.023392
199.0 0.017544
200.0 0.011696
201.0 0.017544
202.0 0.011696
203.0 0.014620
205.0 0.008772
206.0 0.002924
207.0 0.005848
208.0 0.023392
209.0 0.014620
210.0 0.040936
211.0 0.005848
212.0 0.020468
213.0 0.017544
214.0 0.017544
215.0 0.035088
216.0 0.023392
217.0 0.017544
218.0 0.014620
219.0 0.014620
220.0 0.023392
221.0 0.014620
222.0 0.017544
223.0 0.005848
224.0 0.008772
225.0 0.011696
226.0 0.002924
228.0 0.011696
229.0 0.005848
230.0 0.020468
231.0 0.002924

Then, if i want to know that is the specific probability of getting a value:

empiricaldist.Pmf.from_seq(df.flipper_length_mm)(190)
0.06432748538011696

Empirical Cumulative Distribution Function#

This one returns the cumulative probability based on the dataset

plt.figure(figsize=(6, 6))
sns.ecdfplot(data=df, x="flipper_length_mm");
../_images/66e412ca0c0039af3855304eba8e9595da02ccee3118c77a3563d3a39f37b13e.png

Probabilities of getting a penguin with 200 or less flipper_length?

empiricaldist.Cdf.from_seq(df.flipper_length_mm)(200)
array(0.56725146)

See the chart below for a better interpretation

plt.figure(figsize=(6, 6))

empiricaldist.Cdf.from_seq(df.flipper_length_mm).plot()

q=200
p = empiricaldist.Cdf.from_seq(df.flipper_length_mm)(200)

plt.vlines(
    x=q,
    ymin=0,
    ymax=p,
    color="black",
    linestyle="dashed"
)

plt.hlines(
    y=p,
    xmin=empiricaldist.Cdf.from_seq(df.flipper_length_mm).qs[0],
    xmax=q,
    color="black",
    linestyle="dashed"
)

plt.plot(q, p, "ro");
../_images/566f3bb0680c821b4f0cf0ab24939fe168c47eb97239a39bc9379956fdbc70b8.png

Correlations#

Pairplot#

In this case, please remove the year column, despite being a numeric column, it is not adding value to the correlation analysis

sns.pairplot(data=df.drop(["year"], axis=1));
../_images/f1733658e2825886bb3cb71846c0d812f656ec524aef4c650236310893bcc52f.png

Pairplot per specie#

sns.pairplot(data=df.drop(["year"], axis=1), hue="species", palette=penguin_color);
../_images/d21836576320161a18823e6a8a04f086c1e4ddb634edc3d7c9536f0d5a1f04b3.png

Swarmplot#

This one can be used to associate a categorical variable with a numerical:

plt.figure(figsize=(6, 6))
sns.swarmplot(
    data=df,
    x="island",
    y="body_mass_g",
    hue="species",
    palette=penguin_color
);
../_images/7d307b7a215533b707cd6761ce527f991b4c9a0f7d67f786c7f9c1e425b86c41.png
  • There doesn’t seem to be a relation among the penguin body mass and the island where it is located

  • there is a relation among the body mass and the specie

Is there a linear correlation among any of our variables?#

df.drop(["year"], axis=1).corr()          #first remove the year column and then run
/tmp/ipykernel_1302/2116061803.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  df.drop(["year"], axis=1).corr()          #first remove the year column and then run
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.000000 -0.235053 0.656181 0.595110
bill_depth_mm -0.235053 1.000000 -0.583851 -0.471916
flipper_length_mm 0.656181 -0.583851 1.000000 0.871202
body_mass_g 0.595110 -0.471916 0.871202 1.000000
  • Flipper length and body mass have a 0.8712 correlation coef, high

let’s visualize that into a matrix

plt.figure(figsize=(6, 6))
sns.heatmap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True
)
/tmp/ipykernel_1302/2892716928.py:3: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),
<AxesSubplot: >
../_images/7c6a1d51571eb247cac0ec80c3064cf9033379a0e5e2435dc1c1c4594c45fa4f.png
sns.clustermap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True,
    figsize=(6,6)
);
/tmp/ipykernel_1302/2652653307.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),
../_images/7a42e74d9f3a6816de77daa489665f46bade5d15c6960f5ce15ff814db497624.png

How can i represent a categorical variable as a discrete numerical?#

Let’s do it with sex, set female to 0 and male to 1

df = df.assign(
    numeric_sex=lambda df : df.sex.replace(["female", "male"], [0, 1])
)

df_clean = df.dropna()
sns.clustermap(
    data=df.drop(["year"], axis=1).corr(),
    cmap=sns.diverging_palette(20, 230, as_cmap=True),        #nice color palette
    center=0,                                                 #centering the color legend
    vmin=-1,                                                  #set min corr value to -1
    vmax=1,                                                   #set max corr value to 1
    linewidths=0.5,
    annot=True,
    figsize=(6,6)
);
/tmp/ipykernel_1302/2652653307.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  data=df.drop(["year"], axis=1).corr(),
../_images/b0091202a69611d0de171f9e55703ea059da9302b61c0f3ffa6f8ad3680b6088.png

Look like linear correlation is low among sex and the other variables

Multiple regression#

I forgot the scale to weigh the penguins, what is the best way to capture that data?

Let’s create many linear regression models and compare them

  • Use the df_clean, which contains the dataset without null values, just to avoid errors

Model 1#

let’s create a model to predict the “body_mass” using the variable “bill_depth”

#Creating the model

model_1 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_1.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.347
Model: OLS Adj. R-squared: 0.345
Method: Least Squares F-statistic: 176.2
Date: Mon, 09 Jan 2023 Prob (F-statistic): 1.54e-32
Time: 21:34:02 Log-Likelihood: -2629.1
No. Observations: 333 AIC: 5262.
Df Residuals: 331 BIC: 5270.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 388.8452 289.817 1.342 0.181 -181.271 958.961
bill_length_mm 86.7918 6.538 13.276 0.000 73.931 99.652
Omnibus: 6.141 Durbin-Watson: 0.849
Prob(Omnibus): 0.046 Jarque-Bera (JB): 4.899
Skew: -0.197 Prob(JB): 0.0864
Kurtosis: 2.555 Cond. No. 360.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 2#

Now let’s add a second variable “bill_depth_mm”

#Creating the model

model_2 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_2.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.467
Model: OLS Adj. R-squared: 0.464
Method: Least Squares F-statistic: 144.8
Date: Mon, 09 Jan 2023 Prob (F-statistic): 7.04e-46
Time: 21:34:02 Log-Likelihood: -2595.2
No. Observations: 333 AIC: 5196.
Df Residuals: 330 BIC: 5208.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3413.4519 437.911 7.795 0.000 2552.002 4274.902
bill_length_mm 74.8126 6.076 12.313 0.000 62.860 86.765
bill_depth_mm -145.5072 16.873 -8.624 0.000 -178.699 -112.315
Omnibus: 2.839 Durbin-Watson: 1.798
Prob(Omnibus): 0.242 Jarque-Bera (JB): 2.175
Skew: -0.000 Prob(JB): 0.337
Kurtosis: 2.604 Cond. No. 644.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Model 3#

Now we can add the “flipper_length_mm” and see what happens

#Creating the model

model_3 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_3.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.764
Model: OLS Adj. R-squared: 0.762
Method: Least Squares F-statistic: 354.9
Date: Mon, 09 Jan 2023 Prob (F-statistic): 9.26e-103
Time: 21:34:02 Log-Likelihood: -2459.8
No. Observations: 333 AIC: 4928.
Df Residuals: 329 BIC: 4943.
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -6445.4760 566.130 -11.385 0.000 -7559.167 -5331.785
bill_length_mm 3.2929 5.366 0.614 0.540 -7.263 13.849
bill_depth_mm 17.8364 13.826 1.290 0.198 -9.362 45.035
flipper_length_mm 50.7621 2.497 20.327 0.000 45.850 55.675
Omnibus: 5.596 Durbin-Watson: 1.982
Prob(Omnibus): 0.061 Jarque-Bera (JB): 5.469
Skew: 0.312 Prob(JB): 0.0649
Kurtosis: 3.068 Cond. No. 5.44e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.44e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 4#

#Creating the model

model_4 = (
    smf.ols(
        formula="body_mass_g ~ bill_length_mm + bill_depth_mm + flipper_length_mm + C(sex)",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_4.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.823
Model: OLS Adj. R-squared: 0.821
Method: Least Squares F-statistic: 381.3
Date: Mon, 09 Jan 2023 Prob (F-statistic): 6.28e-122
Time: 21:34:02 Log-Likelihood: -2411.8
No. Observations: 333 AIC: 4834.
Df Residuals: 328 BIC: 4853.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -2288.4650 631.580 -3.623 0.000 -3530.924 -1046.006
C(sex)[T.male] 541.0285 51.710 10.463 0.000 439.304 642.753
bill_length_mm -2.3287 4.684 -0.497 0.619 -11.544 6.886
bill_depth_mm -86.0882 15.570 -5.529 0.000 -116.718 -55.459
flipper_length_mm 38.8258 2.448 15.862 0.000 34.011 43.641
Omnibus: 2.598 Durbin-Watson: 1.843
Prob(Omnibus): 0.273 Jarque-Bera (JB): 2.125
Skew: 0.062 Prob(JB): 0.346
Kurtosis: 2.629 Cond. No. 7.01e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.01e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Model 5#

Let’s now try with the “flipper_length_mm” alone

#Creating the model

model_5 = (
    smf.ols(
        formula="body_mass_g ~ flipper_length_mm + C(sex)",
        data=df_clean
    )
    .fit()
)

#Describing the model

model_5.summary()
OLS Regression Results
Dep. Variable: body_mass_g R-squared: 0.806
Model: OLS Adj. R-squared: 0.805
Method: Least Squares F-statistic: 684.8
Date: Mon, 09 Jan 2023 Prob (F-statistic): 3.53e-118
Time: 21:34:02 Log-Likelihood: -2427.2
No. Observations: 333 AIC: 4860.
Df Residuals: 330 BIC: 4872.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -5410.3002 285.798 -18.931 0.000 -5972.515 -4848.085
C(sex)[T.male] 347.8503 40.342 8.623 0.000 268.491 427.209
flipper_length_mm 46.9822 1.441 32.598 0.000 44.147 49.817
Omnibus: 0.262 Durbin-Watson: 1.710
Prob(Omnibus): 0.877 Jarque-Bera (JB): 0.376
Skew: 0.051 Prob(JB): 0.829
Kurtosis: 2.870 Cond. No. 2.95e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.95e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Note that only one variable explains the 76% of the variance.

You decide if sticking with one variable and getting 76% of variance, or stick with 4 and get 80%.

Multiple Regression Model Comparisons#

Creation of a results table#

model_results = pd.DataFrame(
    dict(
        actual_value = df_clean.body_mass_g,
        prediction_model_1 = model_1.predict(),
        prediction_model_2 = model_2.predict(),
        prediction_model_3 = model_3.predict(),
        prediction_model_4 = model_4.predict(),
        prediction_model_5 = model_5.predict(),
        species = df_clean.species,
        sex = df_clean.sex
    )
)

model_results
actual_value prediction_model_1 prediction_model_2 prediction_model_3 prediction_model_4 prediction_model_5 species sex
0 3750.0 3782.402961 3617.641192 3204.761227 3579.136946 3441.323750 Adelie male
1 3800.0 3817.119665 3836.725580 3436.701722 3343.220772 3328.384372 Adelie female
2 3250.0 3886.553073 3809.271371 3906.897032 3639.137335 3751.223949 Adelie female
4 3450.0 3574.102738 3350.786581 3816.705772 3457.954243 3657.259599 Adelie female
5 3650.0 3799.761313 3356.140070 3696.168128 3764.536023 3864.163327 Adelie male
... ... ... ... ... ... ... ... ...
339 4000.0 5231.825347 4706.954140 4599.187485 4455.022405 4662.860306 Chinstrap male
340 3400.0 4164.286703 4034.121055 4274.552753 3894.857519 4080.099176 Chinstrap female
341 3775.0 4693.716437 4475.927353 3839.563668 4063.639819 4005.109853 Chinstrap male
342 4100.0 4797.866549 4449.296758 4720.740455 4652.013882 4803.806832 Chinstrap male
343 3775.0 4745.791493 4448.061337 4104.268240 3672.299099 3892.170475 Chinstrap female

333 rows × 8 columns

Comparing predicted values vs Actual values (ECDFs)#

plt.figure(figsize=(6, 6))
sns.ecdfplot(
    data=model_results
);
../_images/6c34c284d66e96f51964e87ade83e43b2c7a68bd45c8c4d7e6e9333889710538.png

Now, note that the BLUE LINE ARE THE ACTUAL VALUES

Therefore, the line that is closer to the model, would be the better model

Comparing predicted values distributions vs Original values distributions(Kdeplot)#

plt.figure(figsize=(6, 6))
sns.kdeplot(
    data=model_results
);
../_images/effe5cb70103f43b577ad222a453df68906b32e8cdea657891e12a32fa9c5cc6.png

Looks like the purple line (model 4) fits better, but that contains 4 variables.

The model 5 also fist really great with only two variables

Bonus: Pandas Profiling Report#

import palmerpenguins
import pandas as pd
import pandas_profiling as pp

df = palmerpenguins.load_penguins()

pp.ProfileReport(df)

Created in deepnote.com Created in Deepnote